Enhanced Constrained Run-Length Algorithm for Complex Layout Document Processing
نویسنده
چکیده
The Constrained Run-Length Algorithm (CRLA) is a well-known technique for page segmentation. The algorithm is very efficient for partitioning documents with Manhattan layouts but not suited to deal with complex layout pages, e.g. irregular graphics embedded in a text paragraph. Its main drawback is to use only local information during the smearing stage, which may lead to erroneous linkage of text and graphics. This paper presents a solution to this problem by adding global information into the process of the CRLA. This enhanced CRLA can be applied to non-Manhattan page layout successfully. It can also extract text surrounded by a box. Both cases cannot be processed by the original CRLA.
منابع مشابه
Selective CRLA based Layout Analysis and Text Region Extraction from Low Quality Document Images
This paper aims at detecting textual regions by separating graphical regions using Selective CRLA scheme and statistical textual properties on noise infected and low resolution newspaper images. A Bottom Up approach is adopted (i.e.) Selective Constrained Run Length algorithm (CRLA) is applied to obtain the layouts and region growing method over it, segments the homogeneous regions. Statistical...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملSegmentation of historical machine-printed documents using Adaptive Run Length Smoothing and skeleton segmentation paths
In this paper, we strive towards the development of efficient techniques in order to segment document pages resulting from the digitization of historical machine-printed sources. This kind of documents often suffer from low quality and local skew, several degradations due to the old printing matrix quality or ink diffusion, and exhibit complex and dense layout. To face these problems, we introd...
متن کاملSkew detection for complex document images using robust borderlines in both text and non-text regions
0167-8655/$ see front matter 2008 Elsevier B.V. A doi:10.1016/j.patrec.2008.06.008 * Corresponding author. Address: National Lab on University, Beijing 100871, China. Fax: +86 10 62755 E-mail address: [email protected] (H. Liu). A new skew detection method for complex document images based on robust borderlines extracted from both text and non-text regions is proposed in this paper. First, bor...
متن کاملNewspaper Headlines Extraction from Microfilm Images
Automatic indexing is important for a digital library to provide digitized manuscripts of old document images and their electronic text. As an essential step in creating such a system, this paper discusses the issue of extracting headlines from old newspaper microfilms. Most research on document layout analysis has largely assumed relatively clean images. However microfilm images of old newspap...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007